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Abstract 

Recent technological improvements in the field of genetic data extraction give rise to the possibility of reconstructing the 
historical pedigrees of entire populations from the genotypes of individuals living today. Current methods are still not 
practical for real data scenarios as they have limited accuracy and assume unrealistic assumptions of monogamy and 
synchronized generations. In order to address these issues, we develop a new method for pedigree reconstruction, 
PREPARE, which is based on formulations of the pedigree reconstruction problem as variants of graph coloring. The new 
formulation allows us to consider features that were overlooked by previous methods, resulting in a reconstruction of up to 
5 generations back in time, with an order of magnitude improvement of false-negatives rates over the state of the art, while 
keeping a lower level of false positive rates. We demonstrate the accuracy of PREPARE compared to previous approaches 
using simulation studies over a range of population sizes, including inbred and outbred populations, monogamous and 
polygamous mating patterns, as well as synchronous and asynchronous mating. 
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Introduction 

Pedigree reconstruction is an important problem in the field of 
computational genetics, with many potential applications such as 
genealogy inference, heritability estimation, and victim identifica- 
tion [1—4]. Additionally, it has the potential to improve the 
accuracy of current state-of-the-art relationship inference methods 
as it uses family structure in a broader sense than just using 
pairwise genetic similarity information. [5,6]. There are two main 
variants of the problem, which require different algorithmic 
approaches. In the first variant, considered by many classical and 
contemporary papers, the genotypes of several generations are 
given, and an attempt is made to estimate the pedigree which best 
explains the observed individuals, as might be the case in wild 
animal populations. [7-10]. In this paper we consider a more 
difficult variation of the problem, where we are given the 
genotypes of the currently living population only, and try to 
reconstruct the historical pedigree of unobserved ancestors. This 
variant suits well the scenario of reconstructing the pedigrees of 
living human populations. [11]. This variant of pedigree 
reconstruction was previously studied in several theoretical works 
[12,13]. These papers focus on presenting theoretical bounds on 
the length of sequence required for reconstructing pedigrees under 
various combinatorial and stochastic heritability models, but in 



contrast to our work, do not aim to provide practical solutions for 
the problem. 

The level of difficulty of the problem is highly dependent on the 
pedigree in consideration. Particularly, small inbred populations 
pose a considerable challenge since the probability for multiple 
mating events within any two families is high, and therefore 
individual pairs usually have more than two last common 
ancestors (LCAs). Moreover, in small inbred populations there is 
a complex relationship pedigree graph due to mating within the 
family. 

Recently, three methods tackling pedigree reconstruction from 
the genotypes of extant individuals were proposed [1 1,14]; these 
methods assume monogamy, and synchronized generations. 
Although unrealistic, these assumptions provide a starting point 
for developing tools that offer useful methodology. The original 
paper addressing pedigree reconstruction from the genotypes of 
extant individuals, presented the methods COP/ CIP [11]. COP 
assumes infinite population size, and CIP tries to reconstruct the 
pedigree of small inbred populations. IP ED is a follow-up method, 
similar in principal to C/P, but with improved efficiency [14] . The 
main idea behind these methods is to construct the pedigree, 
generation at a time, starting with the given population. In each 
generation they identify sibling groups using genetic similarity 
measures, and assign two common parents to each sibling group. 

In this work, we point out an important and naturally arising 
issue of pedigree reconstruction from extant populations, over- 
looked by all previous methods. We observe that the mother and 
father of a sibling-group have exactly the same descendants (as 
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Author Summary 

Learning the correct relationships between individuals 
from genetic data is a basic theoretical problem in the field 
of genetics, and has many practical consequences. A wide 
variety of statistical methods for genetic analysis assume 
the relationships between individuals are known, and can 
manifest relatedness information to improve inference. 
The current state-of-the-art methods for relationship 
inference consider pair-wise genetic similarity, and use it 
to infer the relationship between each pair of individuals. 
Reconstructing the pedigrees of an entire population 
directly has the potential to use more elaborate relation- 
ship information, and thus obtains a better prediction of 
the familial relationships in the population. In contrast to 
the full set of pair-wise relationships in a population, 
genetic pedigrees provide a lossless and conflict-free 
structure for depicting the relationships between individ- 
uals. In an effort to make pedigree reconstruction practical 
we developed a new method, which is an order of 
magnitude more accurate than previous methods, and is 
the first method that has the ability to reconstruct 
polygamous pedigrees. 



must be the case for monogamous couples). Since the genotypes of 
the parents are unobserved, a pairwise relationship analysis relying 
on the extant descendants will result in maternal relatives having 
the same likelihood of being related to the mother and to the 
father, and vice versa (see Fig. 1). Thus, partitioning the relatives 
into maternal and paternal relatives is required. Undoubtedly, 
ignoring this issue has a great potential influence on the quality of 
inferred pedigrees. We discuss a new framework to help 
understand and correctly deal with this issue, and present a highly 
efficient algorithm under this framework - PREPARE (Pedigree 
Reconstruction of Extant populations using PArtitioning of 
RElatives). We extend our method to the case of polygamous 
pedigrees, and show that our approach results in a considerable 
improvement in accuracy compared to existing tools, both on 
monogamous and polygamous pedigrees. Thus, PREPARE 
presents a method that is capable of dealing with more realistic 
pedigree reconstruction problem as compared to previous 
methods. 

Methods 

Similarly to previous methods, we reconstruct the pedigree 
generation by generation, starting with the last generation, and 
assuming all of the genotypes of the population come from the 
same generation. In iteration k, we take the partial k—l 
generations pedigree, which we call Pk-i, and build Pk by 
adding parents to all of the founder individuals in P&-1- In order 
to construct the correct pedigree, full-siblings should have two 
common parents in the pedigree, and half-siblings should have a 
single common parent. First, we attempt to detect all founder- 
individual pairs in Pk-i which are most likely to be full-siblings, 
leaving the detection of half-sibling to a later stage. In previous 
methods, a sibling graph G = (V,E) is constructed, where V 
includes the set of all founders in Pk-i, and E corresponds to the 
set of pairs of individuals that are likely to be full siblings. Pairs of 
individuals are considered as potential siblings based on the 
genetic similarity of the pair's extant descendants. Sibling groups 
are then detected by finding maximum cliques or proper vertex 
coloring of the graph G. This approach is problematic, since 
individuals with equivalent descendant sets, such as parent 



couples, are completely indistinguishable in the graph G since 
they have exactly the same set of neighbors. As a result, the siblings 
graph includes many redundant edges, and fails to represent the 
true relationship structure. 

In contrast with previous methods, we present an alternative 
graph representation that accounts for the above-mentioned 
ambiguity, and uses the transitive property of the full-sibling 
relationship to correctly find the full-sibling groups. We begin each 
iteration by constructing a contracted siblings graph G' = {V\E'). 
The set of vertices V is composed of disjoint subsets of V. 
Particularly, each VeV corresponds to a subset of V, so that for 
each vi,V2ev' we have Desc(v\) = Desc(v2), where Desc(v) 
represents the set of extent descendants of v (see Fig. 2). Since 
vertices of G' correspond to subsets of V, we refer to vertices in V 
as super-vertices. The set of edges E' corresponds to potential 
sibling relationship between the corresponding super-vertices, i.e., 
(v',u')eE' if there are veV,ueu' such that (v, u)eE. Note that in 
such case, for every vev\ueu', we will have (v, u)eE. Edges have 
weights WeE'^U, representing the confidence of the relationship. 
For a vertex v', we define contract(v) = V for every veV. We 
provide the details for the construction of the set E' and W in 
section 2.1. 

The key idea of our method lies in a procedure for the 
assignment of the edges in G' to edges in G in a consistent way. In 
principle, we are interested in assigning every super-edge (u',v')eE' 
to an edge (u,v)eE that corresponds to the true sibling pair among 
all pairs in (v',t/). In doing so, we need to take into consideration a 
set of constraints on the assignments of neighboring super-edges. 
Ideally, we would like to find the assignment of super-edges to the 
edges of G, which maximizes the likelihood of the observed 
population genotypes. In section 2.2, we formulate this problem as 
an optimization problem using graph terminology, and propose a 
greedy algorithm which solves it in practice. The assignment 
algorithm generates an expanded siblings graph G*=(V,E*), 
where E* <= E, denotes the proposed full-sibling pairs, and forms a 
disjoint clique-cover of the graph. 

Under the monogamy assumption, we finish reconstructing the 
current generation by adding two common-parents to each sibling 
clique in G*. In order to account for potential polygamy we add 
another step that identifies half-siblings and incorporate these into 
a second graph formulation. Our approach for the reconstruction 
of polygamous pedigrees relies on two key observations. First, we 
note that we can treat the full-sibling relation as an equivalence 
relation, and the half-sibling relation as a relation between 
equivalence classes. This is true, since if a and b are full siblings, 
and a and c are half- siblings, then b and c are also half-siblings. 
According to this observation, we construct a half sibling graph 
Gp = (Vp,Ep) where Vp corresponds to the equivalence classes 
defined by the full-sibling relation, and Ep correspond to the half- 
sibling relation. Second, we observe that the children of every 
parent in the founder group of Pk correspond to a clique in Gp. 
We formulate the half-sibling detection problem, as a second 
graph optimization problem. To solve it, we develop a heuristic 
algorithm which attempts to find the maximal-weighted set of 
edges in Gp. The edge set has to satisfy a set of constraints, which 
represent natural constraints that govern half-sibling relation- 
ships. (see section 2.3). 

2.1 Constructing the Contracted Sibling Graph 

We now describe the construction of the graph G' = {V',E'). 
Recall that the set of super-vertices V consists of subsets of V that 
share the same set of extant descendants. For every pair 
(v',u')eV'x V we have to decide whether (V,u')eE'. In order to 
do so, we pick a representative pair (v,w), where 
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Original pedigree We observe the last 2 generations. All parent pairs have 

Genotypes are observed for the the same likelihood 
I — I children only. 




Figure 1. Attempting to reconstruct the simple pedigree on the left, from the genotypes of extant generation (bright blue). 

Considering observed genetic similarity of extant descendants only, we cannot distinguish which of the four parents in the second generation are 
siblings (Correctly inferred sibling relationship are colored blue, and wrong potential sibling-relationships in dashed red). 
doi:1 0.1 371 /journal.pcbi.1 00361 0.gOOl 



v G v',u G u', and calculate three scores, corresponding to 
three putative relations of v and u: unrelated, siblings, and cousins. 
For each such relationship r, we construct a pedigree P r (u,v) by 
adding the relevant ancestry structure. For example, when 
considering the siblings relationship we construct P S iblings( u >v) by 
adding two common parents for v and u. For unrelated pairs we 
construct P U nrelated(u,v) by adding a different pair of parents to 
each node (see Fig. 3). 

We proceed by simulating inheritance on P r (u,v); that is, the 
founders in P r (u,v) are assigned unique haplotypes and we 
simulate the recombination process from top to bottom, with a 
recombination rate of 10~ 8 . We then calculate IBD segments 
between each pair of extant descendants in Desc(u) and Desc(y) 
and calculate two IBD features'. The number of IBD segments, and 
the total length of IBD sharing (we note that these features of IBD 
sharing were also considered by ERSA [15], a method for the 
inference of pair- wise family relationships). We repeat these 
simulations L times for a specified parameter L, thus obtaining an 
empirical estimate for the distribution of the IBD features. Using 
the above empirical distributions, we estimate the probability of 
observing the IBD features for each pair in Desc(v) x Desciu) 
under the relationship r. Since the observed IBD features are 
typically not observed in any of the L simulations, we use a 
smoothed form of the distribution using Gaussian kernel 



smoothing. Formally, let X r \ , . . . ,X r ^ : X ri eU be the simulated 
IBD features in the L simulations for a hypothesized relationship r. 
The density f r (x) at point x is calculated as: 



/,(*)= 



2n-L 



E 



X ri Y B- V (x-X ri ) 



Empirical tests led us to the conclusion that scaling the features 
to have equal variance and using a diagonal bandwidth matrix 
B = [) I with a parameter in the range 1 to 8 gives the best 
results. The parameter L compensates running time and accuracy. 
The accuracy stops improving near L — 50, which ends up with a 
very efficient analysis (See section 2.4 for more details). 

Let IBD a fiER 2 be the observed IBD features between extant 
individuals a and b. The above procedure results in a probability 
f r (IBD a b), for every aeDesc(u),beDesc(v) and every relationship r 
in {siblings, cousins, unrelated} . 

For each relationship r, we define 



score r (u,v)= IT f r (IBD a b) 

aeDesc(u),beDesc(v),a^b 



A. 



B. 



© 





D. 




©0 0©, 



Figure 2. Four examples of vertex contractions, typical for first, second, and third generations. Founders are filled with Grey. Extant 
individuals are outlined in blue. Green arrows stand for the contraction action. 
doi:1 0.1 371 /journal.pcbi.1 00361 0.g002 
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Figure 3. Examples for possible ancestry structures created for individuals u and v in order to test the relationship between them. 

The triangles under u and v represent their existing descendants, edges represent parent-offspring relationship. 
doi:1 0.1 371 /journal.pcbi.1 00361 0.g003 



We note that score r (u,v) can be intuitively interpreted as a 
composite likelihood of r. If score S iblings( u > v ) is larger than 
score unre i ate d(u,v) and score cousins (u,v) we add (u',V) to E with 
the weight 

, score S ibUn g s(u,v) 
2 J v score r (u,v) 



Fig. 4 shows the distribution of under different 

2 j v score r (u,v) 

true relationships. Notice that cases where u,v are distantly related 
(cousins, 2nd-cousins etc.) will tend to have a maximal score under 
score C0US i ns (u,v). This is desirable, since we only seek to distinguish 
siblings from non-siblings at this point. 



2.2 The Assignment Algorithm 

In the assignment stage, we are given the contracted siblings 
graph G = (V ,E f ), and we search for an assignment of a sibling 
relation between super-vertices, depicted by an edge (u',v')eE' to a 
single sibling-relation between two individuals (u,v)eE. Our 
assignment needs to obey the transitivity constraint of the full 
sibling relation. Recall that the weight of an edge w(u\v') 
corresponds to the strength of evidence for the existence of a 
sibling pair (w,v), where ueu',vev'. We therefore formulate the 
edge assignment problem as follows: 

Problem 1. Maximum weight disjoint clique cover edge 
assignment. Given the contracted graph G = (V ,E'), find the 
maximal-weight set of edges E* <= E, such that E* is a legal 
assignment of E f , under the constraint that the set of assigned 
edges E* forms a clique cover of the graph G = (V,E), i.e., E* is 
composed of an edge-disjoint set of cliques. 

We first show that the above problem is NP-hard: 



score 



ns / Sum(score r ) distribution in full siblings 

r 




0.3 
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Full-Cousins 



0.4 0.5 0.6 0.7 

/ Sum(score ) distribution in full cousins 
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score Ha|f Sjb|jngs / Sum(score r ) distribution in half siblings 
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0.7 



0.8 



0.9 



Figure 4. Distribution of relationship scores under specific true relationships. 

doi:1 0.1 371 /journal.pcbi.1 00361 0.g004 



PLOS Computational Biology | www.ploscompbiol.org 



June 2014 | Volume 10 | Issue 6 | e1003610 



Pedigree Reconstruction via Relative Partitioning 



True relationships 




G = (V,E) 




G'=(V',E') 




G*=(V,E*) 



\0/ &C'& 

y 



Figure 5. Intuition for sibling assignment, depicting the potential-siblings graph G, the contracted graph G ' , and assigned graph 

G*. In both examples (u,x),(y,y),(w,z) are parent couples with extant descendants in the observed population. A. For the case where (u,z),(w,v) are 
full-siblings, the contraction will end in G' composed of three super-vertices, connected by two edges; the assignment algorithm will assign each 
edge to a disjoint clique. B. If (u,v) are also full-siblings, a 3-clique is formed in G'; the assignment algorithm assigns all edges to a corresponding 3- 
clique of siblings. 

doi:1 0.1 371 /journal.pcbi.1 00361 0.g005 



Theorem 1. The maximum weight disjoint clique cover edge 
assignment is NP-hard. 

Proof. We will show a reduction from maximum clique. In [16] it 
is shown that it is NP-hard to decide whether a graph G = (V,E) 
has a clique of size n l ~ £ or if its largest clique is smaller than n, 
where =0.01. Consider an instance G = (V,E) to the clique 
problem, and let C be its largest clique. We define G' = (V\E f ), 
where V = V, and E' = E. Thus, any clique cover of G is a legal 
assignment of G' . Note that if \C\ >n l ~ £ then the optimal clique 

cover is necessarily of size at least — - — . On the other hand, if 

|C|<?2 e then it is easy to see that the optimal clique cover is 
obtained in case all cliques in the cover are of size n e , and thus the 

clique cover size is of size at most - . Thus, if the Maximum 

Weight Disjoint Clique Cover Edge Assignment was polynomial, 
then we could decide in polynomial time between the case where 
the maximum clique is of size n and the case where the maximum 
clique is of size n 1 ~ e , which is an NP-hard problem. 



We therefore apply the following greedy algorithm. We will 
need to introduce a few notations. First, we treat vertices VeV as 
vertices in G', as well as subsets of V, depending on the context. 
For each xe V, we denote by Ne* (x) the set of neighbors of x in 
E*. Moreover, we define Ne'(x) = {contract(y)\yeNE*(x)}, i.e., 
the set of super- vertices corresponding to the neighbors of x in E* . 
Finally, let N 0 = {xeV\ \N E *(x)\ =0}. 

We start by setting E* = 0. The algorithm proceeds by 
traversing all super-edges (u\v')eE' in decreasing weight order. In 
each iteration the set E* consists of a set of disjoint cliques of G, 
and E' consists of a set of yet to be assigned edges. For each 
veiVofV and ueu' we say that v can be added to the clique of u if 
for every x'eNe'(u) we have that (x',v')eE'. Similarly, we say that 
ueNqC\u' can be added to the clique of veV if for every x'eNe'(v) 
we have (x',u')eE f . When traversing an edge (u',v f ) we search for 
a pair (u,v) where u has the maximal clique size, \Ne*(u)+ 1|, 
from within u f , veNqC^v', and v can be added to the clique of u 
(or in a symmetric manner that u can be added to the clique of v 
and |7Ve*(v)+1| is maximized). We then assign (u\V) to (u,v) by 



Original pedigree 




Figure 6. An example for the construction of G P in the first generation. 

doi:1 0.1 371 /journal.pcbi.1 00361 0.g006 
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Table 1. Sensitivity and PPV scores (as defined in the results section) of half-siblings using two coloring order schemes. 





PREPARE 




naive 




Population size Sensitivity 


PPV 


Sensitivity 


PPV 


200 1 .0 


0.91 


0.91 


0.91 


300 0.91 


0.88 


0.79 


0.85 


400 0.97 


0.88 


0.85 


0.88 


500 1 .0 


0.88 


0.88 


0.91 


(1) PREPARE's greedy coloring scheme as described in section 2.3. (2) Coloring cliques from the heaviest to lightest; if possible color with F Pl else if possible color with 



M P . 

doi:1 0.1 371/journal.pcbi.1 00361 0.t001 



adding (u,v) to E*, and removing (u',v f ) from E' . We also assign 
(x',v') to (x,v) for every xeNe(u). 

Fig. 5 summarizes the contraction and assignment stages with 
an example. Note that cases such as 3-cliques in G (Fig. 5-B) can 
have multiple assignments with the same score (3 siblings from one 
parent couple, or 3 pairs of siblings from 3 different parent 
couples). In such cases our algorithm chooses the more parsimo- 
nious solution in which there is a smaller number of parents. 

2.3 Half-sibling Detection 

In the following stage we define the half-sibling detection 
problem, where we attempt to detect groups of individuals with a 
single common-parent. First, we define the full-sibling relation, on 
individuals: FS = {(u,v)\u,v have two common parents}. Notice 
that FS is defined as being reflective, and thus it is an equivalence 
relation on V. V/FS is the quotient set of V on FS, which in this 
case is simply the set of disjoint groups of full-siblings. We obtain 
FS from the edges in E* computed in section 2.2. E* is a clique 
cover, and so naturally describes an equivalence relation. 

We define HS = {([u] FS , [v] FS ) \ V(u,v)e[u] FS x [v] FS , u, v share 
exactly one common parent}, which is the half-sibling relation, as 
a relation between equivalence classes in V, in respect to FS. 
Assuming the pedigree is known, HS is defined properly since if u 
and v are full siblings, and u and x are half-siblings, than v and x 
are half siblings. This allows us to simplify the half-sib detection 
problem, by constructing the polygamy graph Gp = (Vp,E F ), 



where Vp = V/FS s.t each vertex veVp, represents a group of full- 
siblings, and each edge (u,v)eEp represents a half-sibling relation 
between u and v (see Fig. 6). The edges are added to Ep, with a 
similar stage to 2.1, only the hypotheses tested this time are made 
for siblings groups (u,v)eEp, and are relevant to the half-sibling 
case (half-siblings, cousins,unrelated) . 

The graph Gp has the convenient property that if a group of 
individuals {v\,...,Vk} V have a single-common-parent then 
{[ v l] v[ v ^]} — Vp f° rm a clique in Gp. We thus assume by 
parsimony, that each clique {u\,...Uk} in Gp connects all of the 
children of a single parent w, such that each UiE{u\,...Uk} is a full- 
sibling-group which contains the children of w and a single mate. 
We therefore formulate the half-sib detection problem, as follows: 

Problem 2. Maximum weight, two-color clique 
cover. Given the graph Gp = (Vp,Ep), find sets of edges 
Fp,Mp^Ep, such that both Fp and 
edge-disjoint set of cliques, FpC\Mp- 
of Fp and Mp is maximized. 

Theorem 2. The Maximum Weight Two Color Clique Cover is JVP- 
hard. 

Proof. We will show a reduction from maximum clique. Consider 
an instance G = (V,E) to the clique problem, and let C be its 
largest clique. If |C|>?z e we can set M=C and F = 0, and 
therefore the optimal solution to the coloring problem has at least 



Mp consist of an 
and the total weight 



edges. On the other hand, if | C\ <n e then the size of each of 





Figure 7. An example for a case where the coloring order we purpose enables coloring more cliques with two colors than coloring 
the same graph with an arbitrary order. The coloring order is depicted near the cliques. In the left graph we follow the depicted order and color 
the clique blue if possible, else we color it dashed-orange. The fourth click cannot be colored since it touches a blue and a dashed-orange clique. In 
the right graph we use our coloring scheme, which prefers coloring cliques touching cliques that are already colored. Using this order we are able to 
color all four cliques. 
doi:1 0.1 371 /journal.pcbi.1 00361 0.g007 
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Original pedigree 



Potential full-sibs in G' 
for generation 2 




Original pedigree 



Potential half-sibs in G n 




for generation 2 




B. Original pedigree Potential half-sibs in G p 

for generation 2 




T 



Original pedigree 




Potential half-sibs in G p 
for generation 2 




Figure 8. Depicting cases where edge removal rules are required in polygamous pedigree reconstruction. Redundant graph edges are 
dashed red, correct edges in solid black. 
doi:1 0.1 371 /journal.pcbi.1 00361 0.g008 



M and P is at most — — , and thus the total size of both of them is 
2 

bounded by n 1 + £ <n 2 ~ 2e /2. Thus, by solving the Maximum 
Weight Two Color Clique Cover in polynomial time we can 
decide between graphs with clique size at most n e and graphs with 
clique size at least n l ~ £ , hence the problem is NP-hard. 

Informally, we try to color all edges Ep in two colors, Fp and 
Mp, s.t each color creates a set of disjoint cliques. Fp colored 
cliques, represent full-sibling-group cliques with a single common 
father, and Mp colored cliques, represent full-sibling-group cliques 
with a single common mother. 

This problem is also NP-hard and we therefore use the following 
greedy approach. For simplicity, we assume Gp is connected. The 
algorithm begins by setting Fp = Mp = 0. We will denote by 
V{F P ) and V{M P ) the set of vertices induced by F P and M P 
respectively. The algorithm proceeds in iterations. In each 
iteration we search for the heaviest clique C<= Vp\V(Fp) such 
that CnV(M P )=£0, and the heaviest clique C^V P \V(M P ) 
such that C(~\V{Fp)i^0 . Without loss of generality, assume that 
the heaviest among those is a clique C in Vp\V(Fp). If C contains 
only one vertex, we search instead for the heaviest clique C in 
V P \{V{Fp)\J V(M P )). We add the edges of C to F P and remove 



these edges from the graph. Clearly, both Fp,Mp consist of a set of 
disjoint cliques of Gp. 

Notice that we try to minimize the number of arbitrarily 
colored cliques, by choosing cliques adjacent to cliques that are 
already colored. Simulation studies show that choosing this 
coloring order increases the half-sibling sensitivity from 85% to 
97% on average (see table 1). It is easy to see that sub-graphs that 
are composed of a connected list of cliques will be colored 
optimally by our coloring scheme. An example for such a graph is 
depicted in Fig. 7. 

The graph formulation of the half-sibling detection assumes that 
each edge in Ep represents a unique half-sibling relationships. We 
notice, that in some cases Gp might contain redundant edges. In 
order to simplify the explanation, we extend the definition of 
Desciu) to nodes in Gp: Descp(u) = (J VG ^{* \xeDesc(v)}. The 

problem arises, when there exists a pair of nodes u,v from the same 
generation, such that Descpiu) <= Descpiy). In such a case, an edge 
(u,x) may be added to Ep, as a result of a relationship (v,x). 
Trying to contract u and v is not sound, since different 
relationships can be detected for u, and v to a third vertex x, by 
testing them separately. Instead, we apply a preprocessing to Gp, 
in the form of a set of parsimonious rules. The rules aim at filtering 



Table 2. Running times of PREPARE on 1.6GHz Intel Core i5-2467M machine with 4G RAM using a single thread. 





Population Size 


monogamous 


polygamous 


100 


31s 


4m 18s 


200 


53s 


9m 21s 


500 


4m 55s 


56m 40s 


1000 


10m 27s 


93m 41s 



The two parameters affecting the running time of prepare is the population size, and whether PREPARE is run on monogamous or polygamous mode. Most of the 
running time is spent on reconstructing the fifth generation. 
doi:1 0.1 371/journal.pcbi.1 00361 0.t002 
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Consensus-accuracy score (popSize=200) Sensitivity score (popSize=200) 




Figure 9. Example for the problematic nature of the consensus-accuracy score, in contrast with the sensitivity score we propose. 

Notice how the unrelated pedigree structure receives similar consensus-accuracy scores to IPED and CIP reconstructions. Still, PREPARE scores 
are significantly higher. Shown are average scores over 5 simulations, and standard deviation bars. (Some error bars are too small to be visible). 
doi:1 0.1 371 /journal.pcbi.1 00361 0.g009 



all the edges, except the ones that explain the observed features in 
the simplest way. 

The first rule we apply concerns the case depicted in Fig. 8-A. 
In this case, an individual a, with a half-sibling b, has children with 
two mates x, and y. Since a,b,x,y do not have full siblings, each of 
them is represented in Gp as a sibling-group of one individual. 
Since x and y have children only with a, their descendant sets are 
contained in a's descendant set. As a result, half-sibling edges 
should form between (x,b) and (y,b), additionally to the correct 
edge (a,b). To deal with this case, if we find a node a, in Vp that 
has two mates, x,y and the following holds: 
(x,b)eE P AND (a,b)eE P , we remove (x,b) (we do the same for 
(y,b)). A similar rule is applied to the contracted graph G', where 
redundant full-sibling edges result from an equivalent case to the 
one just mentioned, and are removed in the same manner (see 
Fig. 8-B). A third rule is applied to Gp to deal with a case similar to 



the one in rule 1 , only the mates x,y are not the mates of a single 
individual a, but instead x is the mate of a, y is the mate of b and 
a,b are full-siblings (see Fig. 8-C). In such a case, a true relation 
(b,c) may cause redundant half-sibling edges (x,c),(y,c). These 
cases are characterized by mates x,y that have few or no full- 
siblings. Thus, we look for edges (a,c),(x,c) where 
IMissI < IMfsI' sucn mat x * s tne mate of a, and remove (x,c) 
from Gp. Finally, we observed half-sibling edges forming between 
two mates (x,y), of (a,b) such that (a,b) are full-siblings. This 
results from the fact that most of a and b's descendant similarity 
was already explained by the formation of the full-sibling 
relationship (a,b). The difference between the half-sibling 
hypothesis and the null hypothesis for (x,y) becomes small. As a 
result, noisy decisions are made. To handle this final case, we 
remove half-sibling edges between mates of full siblings (a,b) if 
they have a half-sibling edge (x,y) in Gp(see Fig. 8-D). 



—♦—IPED —■—CIP —A— PREPARE Optimal 



Sensitivity (popSize=200) PPV (popSize=200) RMSIBDE (popSize=200) 




-A 



5 



—♦—IPED -B-COP — PREPARE Optimal 



Sensitivity (popSize=1000) PPV (popSize=1000) RMSIBDE (popSize=1000) 




2 3 4 5 2 3 4 512345 

Generation Generation Generation 



Figure 10. Comparison of pedigree reconstruction methods for monogamous populations, using Sensitivity, PPV, and RMSIBDE, 
Populations were simulated with Wright-Fisher simulations of 5 generation. Shown are average scores over 5 simulation, with standard deviations 
bars. The optimal RMSIBDE score is calculated by scoring the true k-generation pedigree. The first generation pedigree in the RMSIBDE figures, is 
the score of the pedigree where all individuals are unrelated, and is shown as reference. (Some error bars are too small to be visible). 
doi:10.1371/journal.pcbi.1003610.g010 
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Figure 11. Comparison of pedigree reconstruction methods for polygamous populations. Populations were simulated with polygamous 
Wright-Fisher simulations of 5 generation. Shown are average scores over 5 simulation, with standard deviations bars. (Some error bars are too small 
to be visible). 

doi:1 0.1 371 /journal.pcbi.1 00361 0.gOl 1 



2.4 Efficiency Considerations 

Simulating inheritance for the descendants of every two 
individuals during the graph constructions is very time consuming, 
and is the reason CIP is impractical for large populations, or 
pedigrees deeper than 4 generations. Notice that if a pair of extant 
descendants has exactly the same ancestor structure in the 
pedigree, than the simulated IBD features are sampled from the 
same distribution. IPED purposes caching individual pairs with 
identical inheritance paths, and introduces an accompanying 
dynamic programming algorithm for minimizing the number of 
operations. 

In PREPARE, we use a simplified version of this idea. For 
every pair (a,b) of extant descendants, we calculate a least- 
common-ancestors (LCAs) vector LCAs(a,b), which is a list of the 
meiosis distances between (a,b) and their least common ancestors. 
For example, all full-siblings will have the LCAs(a,b) = [1,1], since 
full-siblings always have two common ancestors, with one 
separating meiosis. We hash the simulated distribution for this 
LCA vector, where the key represents the vector itself, and the 
value is the distribution. We simulate inheritance only when 
needed, i.e. when u',v'eV have at least one descendant pair, 
without a hashed distribution, thus saving most of the redundant 
computation. Practically, the running time of PREPARE is 
equivalent to the running time of IPED, and is even slightly faster 
(see Table. 2). Although LCAs(a,b) does not capture completely 
the ancestry structure for (a,b), we observed empirically (data not 
shown) that running simulations for each ancestry structure does 
not improve the reconstruction accuracy. Apparently, pairs of 
individuals (a,b) with the same LCAs vector have similar IBD 
distributions. The similarity is large enough to make the repetition 
of inheritance simulation for two such pairs redundant. 

2.5 Availability 

The PREPARE method, inheritance simulators, and quality 
evaluation tools are available at http://www.cs.tau.ac.il/ ~heran/ 
cozy gene/ software, shtml 



Results 

We compare the accuracy of our method to previous pedigree 
reconstruction methods on numerous simulations. Different 
simulations include combinations of population size and inheri- 
tance modes (monogamous and polygamous). Smaller population 
sizes correspond to inbred populations with multiple relationships 
between families. Larger populations correspond to outbred 
populations, with simpler pedigree structures. We also study the 
effect of population bottlenecks on the reconstruction quality. In 
order to test PREPARE on a more realistic scenario, we run it on 
a realistic simulation starting from HapMap phaselll CEU and 
YRI populations as founders. The simulation simulates polyga- 
mous random mating in this population for 200 years, reaching to 
a final population size of 1000. Finally, we apply PREPARE on 
the HapMap MEX population as a feasibility test for application 
of our method for real populations. 

3.1 Simulations 

Similarly to previous methods, we use a Wright-Fisher (WF) 
simulator that includes recombination and genders. We add 
several new features, which makes this simulator more flexible. 
First, we add the ability to control polygamy through a polygamy 
probability parameter p, which controls the probability for an 
individual to have a child with more than one mate. Second, we 
add an option to simulate dynamic population sizes by specifying 
an initial population size and a final population size. The simulator 
calculates the required population change per generation and 
modifies the population size with that ratio in every generation. 

Additionally, we experiment with a more realistic forward 
simulator that does not assume synchronized generations, and 
allows polygamy. We simulate inheritance as a function of time, 
where individuals can have children after the age of 20, and die at 
an age drawn from a capped exponential distribution with mean 
50. The birthrate is changed according to the current population 
size, and is tuned to reach a predefined target population size. This 
simulator produces actual recombined haplotypes, from the 
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IBD features for synchronized polygamous population 
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Figure 12. Simulated IBD feature distribution in monogamous and polygamous populations. The overlap in polygamous distributions is 
the main challenge in reconstructing pedigrees of real populations. 
doi:1 0.1 371 /journal.pcbi.1 00361 0.gOl 2 



haplotypes of 160 CEU and YRI HapMap representatives. More 
specifically, the simulation runs in 5 year iterations, and a pool of 
unmated mature individuals is maintained at all times. Every 
iteration, individuals from the pool are matched to uniformly 
drawn mates. A matching has probability m p to succeed. Every 
mated pair has a probability pb to have a child, where pb is 
initialized to be 1, and is modified in every iteration by +0.2 or -0.2 
depending on whether the current population size is smaller or 
larger than the target population size. Polygamy is achieved 
through second-marriage, which can occur since once a mate dies, 
the individual is added back to the unmated pool. Finally, in order 
to include possible IBD detection errors, we detect IBD segments 
from simulated genotypes using GERMLINE, [17], and extract 
the IBD-features information from its output. This simulator also 
has the advantage of having a possible dynamic population size. 



The population grows or shrinks depending on the initial and 
target population sizes. 

3.2 Quality Evaluation 

Many different measures can be accounted in evaluating the quality 
of reconstructed pedigrees. We first use a previously defined score, to 
compare PREPARE to previous methods. For the large part of the 
presentation, we define and use other natural evaluation scores, which 
we deem as more relevant, and interpretable. In previous methods, a 
consensus-accuracy score, which counts the number of extant 
individual-pairs with the same minimal meiosis-distance as in the true 
pedigree was used [14]. This score treats correct detection of unrelated 
pairs and related pairs identically. This is problematic since the number 
of unrelated pairs dominates the score. For example, a trivial algorithm 
that outputs a pedigree where all individuals are unrelated receives a 
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Figure 13. The performance of PREPARE on realistic simulation is comparable to polygamous Wright-Fisher simulations. The 

simulated population grew from 1 60 individuals of the CEU and Y RI HapMap populations to 846 individuals in 200 years. This simulation accounts 
for IBD detection errors, asynchronous mating and dynamic population size. 
doi:1 0.1 371 /journal.pcbi.1 00361 0.gOl 3 



high consensus-accuracy score (see Fig. 9). As a new standard for 
pedigree-reconstruction evaluation, we suggest three types of scores: 
sensitivity, positive-predictive-value (PPV), and IBD-length prediction 
error. 

We define sensitivity as the fraction of correctly constructed 
(distance wise) related pairs from the total number of related pairs in 
the original pedigree. PPV is defined as the fraction of correctly 
constructed related pairs from the total number of related pairs in the 
reconstructed pedigree. More formally, define R as the reconstructed 
pedigree, O as the original pedigree, Dp(iJ) as the minimal number 
of meiosis separating i and j in pedigree P, and RelExtantp as the set 
of extant-individuals, which are related according to pedigree P. Let 



root mean square IBD-length error (RSMIBDE): 



TP 



R,0 = 



E 



iJeRelExtnato 



l(D R (iJ) = D 0 (iJ))) 



Then, 



Sensitivity z 



TP 



R,0 



\RelExtanto\ 



,PPV R , 0 - 



TP R ,o 



\RelExtantji\ 



We run PREPARE for G generation, and compare the scores 
of reconstructed pedigrees for every generation ke{\...G} 
against the first k generations of the original pedigree. This 
way we can assess the accuracy of different relatedness degrees 
(k = 2 corresponds to siblings, k = 3 to siblings and first-cousins, 
etc.) 

Scores such as sensitivity and PPV have the disadvantage of not 
weighing mistakes according to their magnitude. A second 
disadvantage is that the minimal meiotic distance does not capture 
the full complexity of a real pedigree (for example, double cousins 
detected as cousins will get a full scoring). For these reasons, we 
suggest to alternatively measure pedigree quality by calculating the 



RMSIBDE = / 

iJeExtant 



(IBD R (ij)-IBD 0 (ij)f 



where Extant is the set of extant individuals in the population, 
IBDo is the observed total length of IBD segments between 
individuals i and j, and IBDr is the total length of IBD segments 
between individuals i and y, as given from simulating inheritance 
on the reconstructed pedigree R. Since this score is dependent on 
the randomized scoring-simulation, we average the score of 5 runs. 
The RSMIBDE can be interpreted as the expected prediction 
error (in Mbp) of the typical pair-wise total-IBD-length, given the 
reconstructed pedigree. 

3.3 Comparing PREPARE and Competing Methods on 
Monogamous Simulations 

We tested the competing methods on monogamous Wright- 
Fisher simulated population, of constant sizes: 100, 200, 500, and 
1000. When it was possible, we ran CIP (up to 4 generations due 
to its high runtime complexity), and for larger populations we ran 
COP. PREPARE was run in monogamous mode. Results on 100 
and 200 individuals were similar, as well as results for 500 and 
1000 individuals. In Fig. 10, we compare the three methods for 
small populations (200) and larger populations (1000). In all the 
scenarios we tested, PREPARE was the most sensitive; for 
pedigrees of up to 5 generations (corresponding to 3rd cousins) and 
populations as small as 100 individuals. For the larger populations, 
the improvement in sensitivity is highest, where PREPARE is able 
to build a pedigree which correctly predicts the minimal meiosis 




Figure 14. PREPARE successfully isolates the 4 generation pedigree found by CARROT. Nodes correspond to individuals, and edges to 
parent offspring relationships. The last generation individuals are real HapMap individuals, and the other nodes are ancestors predicted by PREPARE. 
doi:10.1371/journal.pcbi.1003610.g014 
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distance of more than 95% of 1st and 2nd degree relatives and 
more than 60% of relatives up to 3rd degree. At the same time, 
PREPARE has a higher PPV up to pedigrees of 4 generations. In 
the 5th generation it gets a lower PPV than the other methods, but 
this disadvantage is not meaningful, since the sensitivity of these 
methods in the 5th generation is very low. PREPARE gives better 
quality of results for larger populations, which is natural, since they 
tend to form simpler pedigrees with less multi-relationships 
between families, and less inbred families. 

Considering RMSIBDE scores, PREPARE gets much better 
scores than the second best method, and is close to the optimal score, 
especially for larger populations. IP ED gets worse RMSIBDE 
scores than CIP/ COP as a result of its practical tendency to over- 
predict inbreeding, which we observed during our experiments. An 
important feature of PREPARED score is that it is non-increasing in 
the number of generations, similarly to the optimal score. In 
contrast, we do not see this behavior in other methods. Interestingly, 
the optimal scores decrease as the population size increases. We 
attribute this mainly to the increasing proportion of unrelated pairs 
in larger populations, which are easier to predict. 

3.4 The Effect of Population Expansion on the Success of 
Pedigree Reconstruction 

The simplified Wright-Fisher model that was used in pedigree 
reconstruction methods up to this day assumes a constant 
population size. Real populations sizes are obviously not constant, 
and it is known that population bottlenecks and expansion affect 
the IBD distribution in the population. We have conducted an 
experiment to test the effect of population size shifts on the 
distribution of chosen IBD features, and as a consequence on the 
quality of the resulting pedigree. We have run the Wright-Fisher 
simulation with changing initial population sizes of 
100,200,300,400,500 and fixed the final population size at 500. 
By looking at the distribution of IBD features between all pairs of 
individuals, it is clear to see that the number of IBD segments and 
the mean IBD segment length have an inverse relationship with 
the initial population size. This corresponds to a higher proportion 
of relatives in the populations with smaller initial size. We have 
found that populations that grow from 100 to 500 individuals in 
five generations have similar IBD feature distributions to 
populations with constant population size of size 200. Interestingly 
the quality of the resulting pedigree of these populations remains 
unchanged when the initial population size is gradually decreased 
from 500 to 200. Only at initial size of 100 does the quality 
decrease. Sensitivity levels for initial population size of 100 are 
0.96,0.75, and 0.54 for 2,3 and 4 generations. The largest decrease 
is for 3 -generation pedigrees where the sensitivity is decreased by 
10% on average. The PPV remains above 0.95 for generation 2,3 
but is decreased from 0.85 to 0.71 in generation 4. 

3.5 Comparing PREPARE and Competing Methods on 
Polygamous Simulations 

To asses the quality of PREPARE on polygamous populations, 
we simulated polygamous populations of sizes 200 and 1000 with 
the Wright-Fisher model. In the simulated populations 33% of the 
siblings are half-siblings on average. Details regarding the 
execution of previous methods are the same as in section 3.3. 
PREPARE was run with the polygamous mode. The results are 
summarized in Fig. 1 1 . Once again PREPARE is generally 
superior in terms of sensitivity, PPV and RMSEIBD. A notable 
exception is IPED's relatively high sensitivity in generations 4 and 
5 in smaller population sizes (200). Note however that this 
sensitivity comes at the cost of very low PPV and very high 
RMSEIBD in these generations. The RMSEIBD of IP ED is not 



shown in the graph since it is out of the charts, getting as high as 
1500 Mbp. This result suggests that IPED has a strong tendency 
to over-predict relationships in small polygamous populations. 

Similarly to the monogamous case, PREPARE achieves higher 
performance on larger, and as a result, more simply related 
populations. For a population size of 1000, PREPARE is able to 
build a polygamous pedigree which correctly predicts the minimal 
meiosis distance of more than 97% of 1st degree relatives and 
more than 80% of 2nd degree relatives while maintaining a PPV 
greater than 80%. Polygamous populations pose a much greater 
challenge for pedigree reconstruction, and the performance is 
decreased in comparison to monogamous populations. According 
to our analysis, the difficulty in reconstructing polygamous 
pedigrees stems from the fact that the IBD feature distributions 
for the range of possible polygamous relationships have greater 
overlap than in monogamous relationships (See Fig. 12). 

3.6 Reconstructing Realistically Simulated HapMap 
Descending Population 

We test the performance of PREPARE on populations 
produced by the polygamous, asynchronous forward simulator. 
We run the simulator for hundreds of simulation years, resulting in 
the mixing of the different generations, and reconstruct the last five 
generations. We use un-phased IBD segments, to account for the 
fact that our input is genotypes, and not haplotypes. As a necessary 
step, we aim to filter out cross-generation relationships, which are 
not currently modeled, by taking the genotypes from the youngest 
age stratum (Ages 0-20). We used the CEU and YRI HapMap 
genotypes as the founder population for our simulation. The 
results show a comparable success to the Wright-Fisher simulation, 
increasing our confidence that PREPARE can be run on real 
populations. All accuracy measures show a decrease in accuracy 
compared to the Wright-Fisher simulation results. This is expected 
due to the addition of several factors (as discussed above), which 
adds to the complexity of the analysis (see Fig. 13). 

3.7 Application for the HapMap MEX Population 

We next use PREPARE to reconstruct the historical pedigree 
for the HapMap MEX population. This population is of interest to 
us since it is known to contain several relatives, including a single 
4-generation pedigree [5]. Age information is not publicly 
available for this dataset. Instead, we use known parent-offspring 
relationships to separate the population into three generations. 
The correct pedigree is not known, so we use previous relationship 
inference results by Stevens et al. to validate our results [18]. 

Running PREPARE on the parent generation of HapMap 
phasell+III MEX genotypes, we are able to detect a single sibling 
relationship (NA19662,NA 19685), three first-cousin relationships 
(NA19662,NA19664), (NA19664,NA 19685), (NA19657,NA19786) 
and two second-cousin relationships (NA19657,NA19785), 
(NA19785,NA19786). We are able to reconstruct correctly the 
pedigree found by Kyriazopoulou et al. We do this fully 
automatically and without using the genotypes of the two known 
grandparents: (NA19662,NA 19685) which makes the reconstruction 
a significantly harder task(see Fig. 14). Further more, all of the 
relationships inferred by PREPARE except (NA19785,NA19786) 
are confirmed by Stevens et al.[18]. (NA19657,NA19786) are 
inferred as Third degree instead of first cousins, and 
(NA 1 965 7,NA 19785) as Unknown degree instead of second cousins. 

Discussion 

In this paper, we take a step towards making pedigree 
reconstruction from present living populations, a realistic objective. 
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By developing better quality assessment tools, we were able to come 
to the conclusion that our method reconstructs pedigrees with 
significantly higher quality then previous methods, and in compa- 
rable running times. PREPARE is the first method to our 
knowledge to address polygamy, and paternal/maternal relative 
partitioning. Although we succeed partitioning the relatives, there is 
no way to know which relatives are really related to the father, and 
which to the mother by considering autosomal data alone. We are 
not worried about this lack in specificity, as we do not strive to learn 
the ancestral genders. Instead, we are interested in inferring the 
pedigree structure, which provides the relatedness structure. Our 
graph framework, brings to the surface several ambiguous cases that 
cannot be solved without utilizing additional subtle information. For 
example, the assignment of a 3-clique (see Fig. 5-B) might be 
decided better by considering three-way IBD sharing. The chance 
of having triple IBD sharing diminishes much faster than the chance 
of pair-wise IBD sharing and limits the theoretical possibility to 
correctly reconstruct these cases in advanced generations. Recon- 
structing inbred relationships correctly remains an unmet challenge 
by all methods in the present. It seems that an approach to deal with 
inbreeding will need to utilize additional inbreeding imprints on the 
data, such as homozygosity levels and other IBD-features not used 
today. Additionally, current methods do not include inbreeding 
options in the hypothesis testing stage, which might lead to the 
wrong conclusions when inbreeding exists. Despite the above, our 
method is able to reconstruct high quality pedigrees by dealing 
correctly with the most frequently arising cases in randomly mating 
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populations. We believe that improving the performance on such 
rare aspects will probably have a small impact on the pedigree 
quality. More importantly, in order to further improve the 
reconstruction quality of polygamous populations, it seems that a 
better set of IBD features needs to be found, with higher separating 
power between different relationship types. Theoretically, the size of 
a family can influence the scores of its founders since larger families 
will contribute more extant individuals to the score computation. 
Simulating populations with differing typical family sizes show little 
effect on the quality of reconstruction. The current PREPARE 
method can be applicable for real populations, with the setback that 
only a specific age-range must be taken as input, such that most 
inter-generation relationships will be excluded. 
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